5 research outputs found
ViTs for SITS: Vision Transformers for Satellite Image Time Series
In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a
fully-attentional model for general Satellite Image Time Series (SITS)
processing based on the Vision Transformer (ViT). TSViT splits a SITS record
into non-overlapping patches in space and time which are tokenized and
subsequently processed by a factorized temporo-spatial encoder. We argue, that
in contrast to natural images, a temporal-then-spatial factorization is more
intuitive for SITS processing and present experimental evidence for this claim.
Additionally, we enhance the model's discriminative power by introducing two
novel mechanisms for acquisition-time-specific temporal positional encodings
and multiple learnable class tokens. The effect of all novel design choices is
evaluated through an extensive ablation study. Our proposed architecture
achieves state-of-the-art performance, surpassing previous approaches by a
significant margin in three publicly available SITS semantic segmentation and
classification datasets. All model, training and evaluation codes are made
publicly available to facilitate further research.Comment: 11 pages, 5 figures, 2 table
Context-self contrastive pretraining for crop type semantic segmentation
In this paper, we propose a fully supervised pre-training scheme based on
contrastive learning particularly tailored to dense classification tasks. The
proposed Context-Self Contrastive Loss (CSCL) learns an embedding space that
makes semantic boundaries pop-up by use of a similarity metric between every
location in a training sample and its local context. For crop type semantic
segmentation from Satellite Image Time Series (SITS) we find performance at
parcel boundaries to be a critical bottleneck and explain how CSCL tackles the
underlying cause of that problem, improving the state-of-the-art performance in
this task. Additionally, using images from the Sentinel-2 (S2) satellite
missions we compile the largest, to our knowledge, SITS dataset densely
annotated by crop type and parcel identities, which we make publicly available
together with the data generation pipeline. Using that data we find CSCL, even
with minimal pre-training, to improve all respective baselines and present a
process for semantic segmentation at super-resolution for obtaining crop
classes at a more granular level. The code and instructions to download the
data can be found in https://github.com/michaeltrs/DeepSatModels.Comment: 15 pages, 17 figure
Synthesising 3D facial motion from "In-the-Wild" speech
Synthesising 3D facial motion from speech is a crucial problem manifesting in
a multitude of applications such as computer games and movies. Recently
proposed methods tackle this problem in controlled conditions of speech. In
this paper, we introduce the first methodology for 3D facial motion synthesis
from speech captured in arbitrary recording conditions ("in-the-wild") and
independent of the speaker. For our purposes, we captured 4D sequences of
people uttering 500 words, contained in the Lip Reading Words (LRW) a publicly
available large-scale in-the-wild dataset, and built a set of 3D blendshapes
appropriate for speech. We correlate the 3D shape parameters of the speech
blendshapes to the LRW audio samples by means of a novel time-warping
technique, named Deep Canonical Attentional Warping (DCAW), that can
simultaneously learn hierarchical non-linear representations and a warping path
in an end-to-end manner. We thoroughly evaluate our proposed methods, and show
the ability of a deep learning model to synthesise 3D facial motion in handling
different speakers and continuous speech signals in uncontrolled conditions